Diverse video captioning through latent variable expansion

نویسندگان

چکیده

• A diverse captioning model of full convolution design is proposed. We develop a new evaluation metric to assess the sentence diversity. Our method achieves superior performance compared state-of-the-art benchmarks. Automatically describing video content with text description challenging but important task, which has been attracting lot attention in computer vision community. Previous works mainly strive for accuracy generated sentences, while ignoring sentences diversity, inconsistent human behavior. In this paper, we aim caption each multiple descriptions and propose novel framework. Concretely, given video, intermediate latent variables conventional encode-decode process are utilized as input conditional generative adversarial network (CGAN) purpose generating sentences. adopt different Convolutional Neural Networks (CNNs) our generator that produces conditioned on discriminator assesses quality Simultaneously, DCE designed captions. evaluate benchmark datasets, where it demonstrates its ability generate results against other methods.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nonlinear Latent Variable Models for Video Sequences

Many high-dimensional time-varying signals can be modeled as a sequence of noisy nonlinear observations of a low-dimensional dynamical process. Given high-dimensional observations and a distribution describing the dynamical process, we present a computationally inexpensive approximate algorithm for estimating the inverse of this mapping. Once this mapping is learned, we can invert it to constru...

متن کامل

Diverse Image Captioning via GroupTalk

Generally speaking, different persons tend to describe images from various aspects due to their individually subjective perception. As a result, generating the appropriate descriptions of images with both diversity and high quality is of great importance. In this paper, we propose a framework called GroupTalk to learn multiple image caption distributions simultaneously and effectively mimic the...

متن کامل

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (...

متن کامل

Captioning Images with Diverse Objects Supplementary Material

We present additional examples of the NOC model’s descriptions on Imagenet images. We first present some examples where the model is able to generate descriptions of an object in different contexts. Then we present several examples to demonstrate the diversity of objects that NOC can describe. We then present examples where the model generates erroneous descriptions and categorize these errors.

متن کامل

Consensus-based Sequence Training for Video Captioning

Captioning models are typically trained using the crossentropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each st...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Pattern Recognition Letters

سال: 2022

ISSN: ['1872-7344', '0167-8655']

DOI: https://doi.org/10.1016/j.patrec.2022.05.021